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Abstract 


In  this  paper  we  describe  two  processor-efficient  implementation  of  the  Maximum  Distance 
Discharge  algorithm  for  the  maximum  flow  problem.  Using  p  =  0{y/m)  processors,  the  first 
implementation  runs  in  0(71^  log(2m/n  -f  p){y/rn/p))  time  and  uses  0{m  -h  ii  logn)  space;  the 
second  implementation  runs  in  0{n"  \og  n(y/rn/p))  time  and  uses  0(77?  -fplog/i)  space.  These 
bounds  are  within  a  logarithmic  factor  of  the  0{n^y/m)  time  and  0(m  -f  /?)  space  bounds  on 
the  sequential  Maximum  Distance  Discharge  Algorithm. 


1  Introduction 

The  maximum  flow  problem  is  a  classical  combinatorial  optimization  problem,  which  has  been 
widely  studied  in  the  context  of  sequential  computation  (see  e.g.  [7,  12]).  Recently,  parallel  algo¬ 
rithms  for  the  problem  have  been  studied  as  well.  .Although  the  problem  is  known  to  be  P-complete 
[14],  significant  speedups  can  be  obtained  by  using  a  parallel  algorithm  for  the  problem,  both  in 
theory  and  in  practice  [11]. 

The  first  parallel  algorithm  for  the  maximum  flow  problem  is  due  to  Shiloach  and  Vishkin  [18]. 
This  algorithm  is  based  on  the  blocking  flow  method  of  Dinic  [6]  and  runs  in  0{n^  log  n)  time  using 
n  processors  and  0{n^)  memory.^  In  [10],  the  author  introduced  the  first-in.  first-out  {FIFO)  algo¬ 
rithm  that  runs  in  the  same  time  and  processor  bounds  but  uses  0(  m)  of  memory.  This  algorithm 
is  the  first  of  the  push-relabel  maximum  flow  algorithms.  The  push-relabel  method  was  developed 
by  Goldberg  and  Tarjan  [13]  as  a  generalization  of  it.  The  ruaximum  distance  discharge  [MOD) 
algorithm  [13]  is  another  variation  of  the  generic  push-relabel  method.  A  i)arallel  implementation 
of  this  algorithm  similar  to  that  of  the  FIFO  algorithm  achieves  the  same  asymptotic  resource 
bounds  [13]. 

The  original  sequential  running  time  bound  for  both  the  f /FO  and  the  MOD  algorithms  was 
O(n^).  Cheriyan  and  Macheshwari  [3]  show  that  this  bound  is  tight  for  the  F/FOalgorithm  (j.e.,  the 
algorithm  requires  fl(n^)  time  in  the  worst  case),  whereas  for  the  MDD  algorithm  the  bound  can  be 
improved  to  0{n^\/m).  Thus  an  n-processor,  0(n^  log  n)-time  parallel  implementation  is  reasonable 
for  FIFO  algorithm  since  the  corresponding  time-processor  product  is  within  a  logarithmic  factor 
of  the  sequential  time  bound.  However,  such  an  implementation  is  not  as  good  for  the  MDD 
algorithm,  since  the  time-processor  product  exceeds  the  sequential  running  time  bound  by  a  factor 
of  0((n/\/m)logn),  which  is  quite  large  for  sparse  graphs. 

In  this  paper  we  describe  an  implementation  of  the  MDD  algorithm  that  runs  in  0(n^  log  J^)) 
time  using  y/m  processors  and  0{m  -I-  nlogn)  memory.  Using  p  =  0{y/m  )  processors,  the  imple¬ 
mentation  runs  in  0{n'^\og{2Tn/n  +  p){y/Tnlp))  time.  A  variation  of  this  implementation  that  uses 
the  same  number  of  processors  and  O(m-l-plogn)  memory  runs  in  0{n‘  log  n{  y/m/p))  time.  These 
are  the  best  strongly  polynomial  bounds  for  a  processor-efficient  ma.'cimum  (low  algorithm.  (If  the 

'Throngliout  this  paper,  n  and  m  denote  the  number  of  vertices  and  the  number  of  arcs  in  the  input  network. 
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capacities  are  integers  bounded  by  j!7 ,  a  parallel  implementation  of  a  scaling  version  of  the  push- 
relabel  method,  due  to  Ahuja  and  Orlin  [1],  runs  in  log(C/)  logn)  time  using  m/n  processors 
and  0(m)  memory.)  The  techniques  and  data  structures  used  in  our  implementation  may  be  useful 
for  obtaining  processor-efficient  implementations  of  other  graph  algorithms. 

Processor-efficient  algorithms  for  the  bipartite  matching  problem,  which  is  closely  related  to 
the  maximum  flow  problem,  are  discussed  in  [9]. 


2  Background 

We  use  the  following  definitions.  Let  G  =  (V,E)  be  a  directed  graph  with  vertex  set  V  of  size  n 
and  arc  set  E  of  size  m.  For  ease  in  stating  time  bounds,  we  assume  that  m  >  n  and  therefore 
log(m/n)  >  0.  Define  E  ^  =  {(«;,u)|(u,  m)  €  E}  and  E'^  —  E\JE~^.  For  any  vertex  w  we 
denote  by  E(w)  the  set  of  vertices  adjacent  out  from  to,  E{w)  =  {x|(tu,x)  €  E},  and  by  E~'^{tv) 
the  set  of  vertices  adjacent  into  w,  E  ^(iu)  =  {u|(u,  m)  £  E}.  Graph  is  a  network  if  it  has 
two  distinguished  vertices,  a  source  s  and  a  sink  t,  and  a  nonnegative  real-valued  capacity  u{v,'w) 
on  every  arc  {v,w).  A  preflow  on  a  network  is  a  nonnegative  real- valued  function  /  on  the  arcs 
such  that  f{v,  w)  <  u{v,  w)  for  every  arc  {v,  w)  and  EvcF-Hu-)  /(«>  ‘^)  >  Er€£;{u;)  /(^>  )  for  every 
vertex  w  s.  The  quantity  eflw)  = 

f(w,x)  is  called  the  excess  at 

vertex  tu.  A  preflow  /  is  a  flow  if  ef(w)  =  0  for  every  vertex  w  ^  {s,  t}.  A  cost  function  c  :  F  —  R 

assigns  a  cost  to  arcs  of  the  network.  We  assume  that  costs  are  integers  in  the  range  [-C, _ C]. 

The  residual  capacity  of  an  arc  (v,  w)  with  respect  to  a  preflow  /  is  Uf(v,  tv)  =  u(v,  w)  -  /( u,  w). 
Arc  {v,w)  is  saturated  if  Uf(v,w)  =  0  and  residual  if  Uf{v,w)  >  0.  A  value  of  a  flow  /  is  the  e.xcess 
of  the  sink  €j{t).  The  maximum  flow  problem  is  to  find  a  flow  of  the  biggest  value. 

To  get  slightly  better  running  time  bounds,  we  sometimes  assume,  without  loss  of  generality, 
that  the  maximum  degree  of  a  vertex  in  the  input  graph  is  at  most  A  =  •2m /n.  To  justify  this 
assumption,  consider  an  arbitrary  graph  and  replace  each  vertex  v  with  degree  deef  v)  >  A  by 
^  =  \deg(v)/{A  —  2)]  vertices  Vi,...,Vk  connected  in  a  ring  with  arcs  of  a  very  high  capacity  (e.g. 
the  sum  of  all  original  capacities).  Distribute  arcs  of  v  along  Vi,...,Vk  so  that  the  degree  of  the 
new  vertices  is  bounded  by  A.  If  the  original  graph  has  n  vertices  and  m  arcs,  the  transformed 
graph  has  at  most  2n  vertices  and  m-j-  ti  arcs. 

Our  model  of  parallel  computation  is  the  concurrent-read,  exclusive-write  parallel  random  access 
machine  (CREW  PRAM)  [8].  We  will  use  the  fact  that  in  this  model,  given  a  list  of  size  /  and 
;>  >  /  processors,  ranking  the  list,  doing  a  parallel  prefix  computation  on  the  list,  and  sorting  ilu' 
list  takes  (9(log/)  time  [4,  5,  15,  17]. 


'push{v,  w). 

Applicability:  v  is  active  and  (v^w)  is  admissible. 

Action:  send  S  £  (0,  min(e/(t;),  iz;))]  units  of  flow  from  v  to  w. 

relabel(v). 

Applicability:  either  s  or  ^  is  reachable  from  v  \n  Gj  and  Viy  £  V  Uf(v^w)  =  0  or  d{w)  >  d{v). 
Action:  replace  d{v)  by  min(v,t/;)ef;/ {d(it;)}  +  1. 


Figure  1:  The  push  and  re/a6e/ operations. 


3  The  Push  and  Relabel  Operations 

In  this  section  we  review  the  push  and  re/ai>e/ operations.  See  [13]  for  more  details. 

To  describe  these  operations,  we  need  the  following  definitions.  For  a  given  preflow  /,  a  distance 
labeling  is  a  function  d  from  the  vertices  to  the  nonnegative  integers  such  that  d{t)  =  0,  d{s)  =  a, 
and  d{v)  <  d{w)  +  1  for  all  residual  arcs  (u,?n).  We  say  that  a  vertex  v  is  active  if  v  ^ 
ej(v)  >  0.  Note  that  a  preflow  /  is  a  flow  if  and  only  if  there  are  no  active  vertices.  An  arc  (u,  lu) 
is  admissible  if  {Vytu)  £  Ej  and  d(u)  =  d{w)  +  1. 

The  push-relabel  method  maintains  a  preflow  /  and  a  distance  labeling  d,  which  are  modified 
using  the  push  and  the  re/a6e/ operations,  respectively.  A  push  from  v  to  w  increases  and 

ef{w)  by  6  =  min{e/(t;),  in)},  and  decreases  /(in,i;)  and  ej{v)  by  the  same  amount.  The 

push  is  saturating  if  Uf{v^w)  =  0  after  the  push  and  nonsaturating  otheimse,  A  re/a6e/ operation, 
applied  to  a  vertex  n,  sets  the  label  of  v  equal  to  the  largest  value  allowed  by  the  valid  labeling 
constraints.  The  6aszc  operations  are  summarized  in  Figure  1. 

The  generic  push-relabel  method  initializes  /  and  d  and  repetitively  performs  an  applicable  push 
or  re/aie/ operation.  When  no  operation  applies,  the  method  terminates.  During  initialization,  / 
is  set  to  the  arc  capacity  on  each  arc  leaving  the  source  and  zero  on  all  arcs  not  incident  to  the 
source.  The  distance  labeling  is  initialized  as  follows:  d{s)  =  n  and  d{v)  =  0  for  u  G  F  -  {s}.  A 
summary  of  the  algorithm  appears  in  Figure  2. 

The  generic  method  hets  the  following  properties  [13]: 


•  The  algorithm  always  terminates  with  a  maximum  flow. 

•  The  number  of  re/«6e/ operations  used  is  O(n^)  and  the  total  cost  of  these  operations  is  0(;rn?). 

•  The  number  of  saturating  push  operations  and  the  total  cost  of  these  operations  is  0(nm). 

•  The  number  of  nonsaturating  push  operations  and  the  total  cost  of  these  operations  is  0( //-///). 
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procedure  generic  (V",  u); 

[initialization] 

V(t;,  w)  E.  E  do  begin 
f{v,w)  ^  0; 

if  =  5  then  f{s,  w)  ^  ii{s,  w); 

if  w  zz  s  then  f{v,s)  < - 

end; 

\fw  €  K  do  begin 

«/(«^)  ^  E(«,ti,)€£;/(^>"'); 
if  ly  =  5  then  d{w)  =  n  else  d{w)  =  0; 
end; 

[loop] 

while  3  an  active  vertex  do 

select  an  update  operation  and  apply  it; 
retiirn(/); 
end. 


Figure  2:  The  generic  maximum  flow  algorithm. 


disckarge{v). 

Applicability:  v  is  active. 

Action:  while  e/(v)  >  0  and  v  is  not  relabeled  do 

if  3  an  admissible  arc  (v,  w) 
then  push{v,  w) 
else  relabel{v); 


Figure  3:  The  discharge  operation. 


4  The  Maximum  Distance  Discharge  Algorithm 

The  generic  algorithm  does  not  specify  the  ordering  in  which  the  basic  operations  are  applied. 
Some  orderings,  however,  are  more  efficient  then  others.  In  this  section  we  describe  an  ordering 
of  the  operations  that  leads  to  an  0(n^-^/m)-time  sequential  algorithm.  As  we  shall  see  later,  this 
algorithm  has  a  substantial  degree  of  parallelism.  Since  parallel  algorithms  are  of  main  concern 
here,  we  omit  low-level  detail  of  the  sequential  algorithm  in  our  description.  These  details  can  be 
found  in  [12,  13]. 

The  discharge  operation,  described  in  Figure  3,  combines  the  basic  operations  locally  (at  a 
vertex).  The  discharge  operation  is  applicable  to  an  active  vertex  v.  This  operation  iteratively 
reduces  the  excess  at  v  by  pushing  it  through  admissible  arcs  going  out  of  v  if  such  arcs  exist; 
otherwise,  discharge  relabels  v.  The  operation  stops  when  the  excess  at  v  is  reduced  to  zero  or  v  is 
relabeled.  Note  that  discharge  rel&heles  v  only  when  the  re/a6e/ operation  applies. 

The  second  step  to  an  efficient  ordering  of  basic  operation  is  to  restrict  the  order  of  processing 
of  active  vertices.  The  MOD  algorithm  always  selects  for  discharging  an  active  vertex  with  the 
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procedure  process-verter^ 
remove  a  vertex  v  from 
old-label  ^  d(v); 
discharge{vy^ 

add  each  vertex  w  made  active  by  the  discharge  to 
if  d{v)  ^  old-label  then  begin 
b  ^  d{v); 
add  V  to  Bb; 

end 

else  if  5^  =  0  then  6^6—1; 
end. 


Figure  4:  The  process-vertex  procedure. 

largest  label.  The  corresponding  parallel  algorithm  processes  all  such  vertices  at  once. 

The  sequential  implementation  of  the  largest-label  algorithm  maintains  an  array  of  sets  5,, 
0  <  i  <  2n  -  1,  and  an  index  b  into  the  array.  Set  Bi  consists  of  all  active  vertices  with  label 
i,  represented  as  a  doubly-hnked  list,  so  that  insertion  and  deletion  take  0(1)  time.  The  index  6 
is  the  largest  label  of  an  active  vertex.  During  the  initialization,  active  vertices  are  placed  in  Bq, 
and  b  is  set  to  0.  At  each  iteration,  the  algorithm  removes  a  vertex  from  processes  it  using 
the  discharge  operation,  and  updates  b.  The  algorithm  terminates  when  b  becomes  negative,  i.e., 
when  there  are  no  active  vertices.  This  processing  of  vertices,  which  implements  the  while  loop  of 
the  generic  algorithm,  is  described  in  Figure  4. 

VVe  define  a  phase  of  the  algorithm  as  a  maximal  time  interval  during  which  b  remains  constant. 
The  notion  of  phase  is  important  both  for  the  sequential  and  the  parallel  analysis  of  the  algorithm. 

Lemma  4.1  [13]  The  number  of  phases  during  an  execution  of  the  MDD  algorithm  is  0{‘n?). 

The  following  theorem  gives  the  sequential  running  time  bound  for  the  algorithm. 

Theorem  4.2  [3]  The  MDD  algorithm  runs  in  0{‘n}^/m)  time. 

Lemma  4.1  suggests  a  parallel  version  of  the  algorithm,  where  all  largest-labeled  active  vertice.s 
are  processes  in  parallel.  The  running  time  of  the  resulting  algorithm  is  0{v})  times  the  time  needed 
for  the  parallel  processing  of  the  vertices.  The  next  section  describes  such  an  implementation. 


5  A  Processor-Efficient  Parallel  Implementation 

Straight-forward  implementations  of  parallel  maximum  flow  algorithms,  described  in  [11,  18].  use 
a  linear  number  of  processors.  Shiloach  and  Vislikin  [18]  show  that  their  algorithm  can  be  imple¬ 
mented  with  n  processors  and  O(n^)  space  and  still  achieve  the  0(7?^ log n)  time  bound.  In  this 


section  we  extend  their  techniques  to  obtain  a  parallel  implementation  of  the  MDD  algorithm  that 
uses  y/m  processors,  runs  in  O(n^logn)  time,  and  needs  0(m  +  nlogn)  space.  Using  p  =  0(\/Tn) 
processors,  the  implementation  runs  in  0(n^log(A+p))(-^/m/p)  time.  The  time-processors  product 
of  this  parallel  implementation  is  within  a  logarithmic  factor  of  the  number  of  operations  of  the 
ujiderlying  sequential  method.  A  variation  of  this  implementation  achieves  a  slightly  better  space 
bound  at  the  expense  of  a  slightly  worse  time  bound. 

To  obtain  a  processor-efficient  implementation  of  the  algorithm,  we  have  to  provide  a  mechanism 
for  assigning  the  work  to  processors  in  such  a  way  that  most  processors  are  busy  most  of  the  time. 
Doing  this  scheduling  “on-line”  is  the  biggest  problem  our  implementation  has  to  overcome. 

The  implementation  maintains  the  sets  Bi  of  active  vertices  with  the  distance  label  /,  for 
0  <  i  <  2n  —  1.  The  index  b  is  maintained  as  in  the  sequential  implementation.  The  sets  are 
maintained  so  that  in  O(logp)  time,  several  processors  can  add  a  vertex  each  to  the  sets,  and 
several  elements  of  Bb  can  be  assigned  to  different  processors  and  removed  from  the  set. 

A  straight-forward  way  to  implement  these  operations  is  to  use  an  array  of  length  n  for  each 
set.  The  elements  of  Bi  occupy  the  first  |B,|  locations  of  the  corresponding  array.  The  processors 
that  want  to  add  elements  to  |B,|  are  enumerated  and  add  their  elements  to  the  array  position 
determined  by  their  rank  and  |5,|;  after  this  is  done,  |.Sj|  is  updated.  To  assign  elements  of  Bb  to  a 
set  of  processors  of  size  k  <  Bb,  the  processors  are  ranked  and  assigned  elements  starting  from  the 
end  of  the  corresponding  array.  Then  Bb  is  decreased  by  k.  This  straightforward  implementation 
meets  the  desired  time  bound  but  uses  O(n^)  space. 

To  reduce  the  space  requirement,  we  take  advantageof  the  fact  that  the  total  number  of  elements 
in  all  sets  Bi  is  at  most  n.  We  discuss  two  parallel  data  structures  that  can  be  used  to  maintain 
sets  Bi  in  a  space- efficient  way. 

One  such  data  structure  is  a  dynamic  array.  A  dynamic  array  consists  of  an  ordered  list  of 
segments.  If  the  dynamic  array  contains  k  elements,  the  number  of  segments  in  the  list  is  [log  k]  + 1. 
Each  segment  is  an  array;  the  length  of  the  first  two  segments  is  1,  and  for  j  >  2,  the  length  of  the 
jth  segment  is  2'“U  Note  that  more  than  half  of  the  space  aUocated  to  the  nonempty  segments  is 
actually  used  and  all  segment  aire  of  size  at  most  n.  Using  this  observation,  it  is  easy  to  see  that 
the  total  space  required  is  0(n  log  n). 

We  can  implement  the  lists  of  segments  by  arrays  of  pointers  to  the  segments.  These  arrays  take 
0(74  log  n)  space.  The  time  required  for  /  processors  to  add  elements  to  the  sets  Bi  or  to  remove  / 
elements  from  Bb  is  0(log/)  (using  sorting  or  ranking  of  processors  operating  on  the  same  set). 

Alternatively,  we  can  use  the  parallel  2-3  tree  data  structure  described  in  [16].  This  data 
structure  allows  addition  (deletion)  of  /  elements  by  I  processors  to  (from)  a  set  of  k  elements  in 
OflogAr  +  \ogl)  time  and  0{k  -1-  llogk)  space.  In  our  application  k  <  n  and  I  <  p  =  0(n),  so  the 
bounds  can  be  rewritten  as  OflogTr)  time  and  0{n  +  plogn)  space. 

Our  implementation  of  the  MDD  algorithm  works  in  iterations,  each  of  which  implements  a 
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pass  of  the  sequential  algorithm.  Each  iteration  is  divided  into  three  phases.  During  the  first  phase 
flow  is  pushed  out  of  the  active  vertices  in  5^.  During  the  second  ph2ise  this  flow  is  collected  at  the 
destination  vertices.  The  last  phase  relabels  the  appropriate  vertices. 

For  the  purpose  of  scheduling,  one  has  to  keep  track  of  the  number  of  processors  needed  to 
perform  a  relabel  or  a  discharge  operation.  In  the  sequential  algorithm,  the  number  steps  required 
to  relabel  a  vertex  v  is  linear  in  the  degree  of  u,  so  in  the  parallel  implementation  we  assign  for  this 
operation  the  number  of  processors  equal  to  the  degree  of  v.  Discharging  a  vertex  v  requires  the 
number  of  processors  equal  to  the  number  of  pushes  performed  during  the  discharge.  We  maintain 
a  data  structure  at  each  vertex  that  allows  fast  computation  of  this  number  by  a  single  processor. 
This  data  structure  is  also  used  for  pushing  the  flow. 

The  data  structure  we  use  is  a  variant  of  the  partial  sum  tree  data  structure  [18].  A  partial  sum 
tree  is  a  balanced  binary  tree  with  leaves  corresponding  to  edges  adjacent  to  a  vertex.  Each  vertex 
V  has  two  trees  associated  with  it,  the  out-tree{v)  which  is  used  to  push  flow  out  of  the  vertex,  and 
the  in-tr€e{v)  which  is  used  to  collect  flow  pushed  into  the  vertex. 

The  out’trees  are  used  in  the  first  phase.  Leaves  of  the  out-tr€e{v)  correspond  to  the  arcs  (u,  it;). 
Each  node  x  of  the  tree  has  two  labels,  a(a;)  and  b{x).  The  label  values  are  defined  as  follows.  If  x 
is  a  leaf  corresponding  to  the  arc  then 


a(a;) 


iif{v,iu))  if  (u,  in)  is  admissible 
0  otherwise 


and 


b{x) 


1  if  (u,  u;)  is  admissible 
0  otherwise. 


If  X  is  not  a  leaf,  then  a{x)  and  b{x)  are  equal  to  the  sums  of  the  corresponding  values  of  the 
children  of  x.  In  other  words,  a{x)  is  equal  to  the  sum  of  residual  capacities  of  the  admissible  arcs 

corresponding  to  the  leaves  of  the  subtree  rooted  at  .-r,  and  b{x)  is  the  number  of  such  admissible 
arcs. 

Suppose  t;  is  an  active  vertex  to  be  discharged.  First  a  processor  is  assigned  to  v  to  determine 
the  number  of  pushes  p(v)  which  will  be  made  out  of  v.  Using  a  and  b  values  of  out-tree{v)  and  the 
value  of  e/(v),  this  can  be  done  in  O(logA)  time. 

To  push  the  flow  out  of  u,  we  assign  p(  i>)  processors  to  v.  Theii  we  associate  each  processor 
with  the  arc  it  will  push  the  flow  through.  To  do  this,  we  rank  the  processors  assigned  to  u, 
which  takes  O(logp)  time.  Then  each  processor  goes  down  the  tree  starting  from  the  root  and 
picking  at  each  step  the  left  or  the  right  child  of  its  current  node  x  depending  on  the  processor’s 
rank  and  on  the  b  values  of  the  children  of  x.  .4t  the  end,  the  ith  processor  will  be  at  the  leaf 
corresponding  to  the  ?'th  admissible  arc  of  u.  Note  that  this  process  requires  concurrent  read.  Then 
each  of  the  processors  computes  the  amount  S(  v.  w)  to  be  pushed  along  the  arc  corresponding  to 
the  processor.  For  all  but  the  last  processor  assigned  to  v.  this  amount  is  equal  to  uj{l\w),  since 


the  corresponding  pushes  are  saturating.  For  the  last  processor,  this  amount  is  equal  to  Uf(v,w)  if 
a{rooi(out-tr€e{v)))  <  ef{v)  and  to  a{root(otit-tree(v)))  —  €f{v)  otherwise.  The  processors  update 
the  flow  function  on  the  corresponding  arcs  and  the  last  processor  updates  €jiv). 

Next  the  ouUtrees  are  updated  going  from  the  bottom  level  of  the  tree  up.  Initially  each 
processor  updates  the  a  and  b  labels  at  the  leaf  assigned  to  it.  Then  the  processor  decides  if  it  will 
stop  updating  or  not.  The  processor  stops  the  update  only  if  it  is  currently  at  the  root  of  the  tree 
or  if  it  is  at  a  tree  node  x  which  is  a  right  son  of  its  father  y,  and  the  left  son  of  y  has  just  been 
updated,  z.e.,  also  has  a  processor  working  on  it.  The  process  is  repeated  until  the  root  of  the  tree 
is  reached  and  updated. 

In  the  second  stage,  the  flow  pushed  into  vertices  w  is  collected  using  the  in-trees.  Leaves  of 
in-tree{w)  correspond  to  arcs  entering  w,  A  processor  that  was  assigned  to  the  arc  {v,  iu)  when 
pushing  flow  out  of  v  is  assigned  to  the  same  arc  when  processing  the  flow  pushed  into  w.  Every 
node  X  of  m-irec(tn)  has  a  variable  (l^{x)  associated  with  it.  If  x  is  a  leaf  corresponding  to  an  arc 
(v,  w)j  then  a'(x)  is  set  to  the  amount  equal  to  that  just  pushed  along  the  arc.  If  x  is  not  a  leaf, 
then  a'(x)  is  equal  to  the  sum  of  the  values  of  the  corresponding  variables  of  its  children.  The 
values  of  variables  are  propagated  going  from  the  leaves  to  the  root  in  the  same  way  as  the 
values  of  a  variables  of  the  out-trees.  The  update  takes  O(log  A)  time.  After  the  update,  €f{w)  is 

increased  by  a^(in-ir€€{root{v))).  Then  the  values  of  variables  are  reinitialized  to  0  by  making 
another  leaves-to-root  pass. 

Relabehngs  are  implemented  by  maintaining  an  array  of  vertices  to  be  relabeled.  (Note  that 
this  array  has  at  most  n  items).  Vertices  that  were  unable  to  get  rid  of  their  excesses  during  a 
discharge  are  added  to  the  end  of  the  array  using  ranking  of  the  processors  that  want  to  add  the 
vertices  to  the  array.  During  the  relabeling  stage,  processors  are  assigned  to  the  last  p  elements 
of  the  array  or  to  every  element  of  the  array  if  there  are  less  then  p  elements  in  it.  Using  parallel 
prefix  computations  on  the  portion  of  the  array  for  which  the  processors  have  been  assigned,  vertices 
needing  relabehng  are  assigned  the  number  of  processors  equal  to  their  degrees,  and  the  relabeling 
is  performed  by  doing  a  parallel  prefix  computation  on  edge  list  of  every  vertex  that  needs  to  be 
relabeled. 

The  analysis  is  based  on  the  following  theorem  of  Brent, 

Theorem  5.1  [2]  Any  synchronized  parallel  algorithm  of  depth  d  that  consists  of  a  total  of  x  elementary 
operations  can  be  implemented  by  p  processors  within  a  depth  of  \x/p]  -b  d. 

Let  macro-operations  be  any  standard  unit-time  PRAM  operation  plus  sequences  of  operations 
performed  by  individual  processors  while  working  on  an  in-tree  or  an  out-tree  or  performing  an 
operation  of  ranking,  sorting,  or  computing  parallel  prefix.  Note  that  the  macro-operations  used 
by  the  algorithm  take  0(  log(A-}-/>))  time.  By  Lemma  4.1,  the  depth  of  the  algorithm  is  0(  /?-)  macro¬ 
operations.  The  total  number  of  macro-operations  used  by  the  algorithm  is  0[n'^y/in),  Applying 
Theorem  5.1  for  p  =  0{\/m.)  and  using  the  fact  that  each  macro-operation  takes  0(log(A  -f  p)) 


time,  we  get  the  following  result. 


Theorem  5.2  The  parallel  implementations  of  the  A/DD  algorithm  run  in  log(2m/n+p)(v/^/p)) 

time  using  p  =  0(y/rn)  processors  and  0{m  +  Jilogrr)  memory,  or  in  0{n^\ogn{y/rnfp))  time  using 
p  =  0{y/m)  processors  and  0{m  +  p\ogn)  memory. 

Note  that  the  time  bounds  in  the  above  theorem  exceed  those  with  optimal  processor  utiliza- 
tion  by  a  factor  of  O(logn).  The  amount  of  space  required  by  the  implementations  is  (slightly) 
superlinear.  The  latter  fact  is  due  to  the  space  requirements  of  the  dynamic  array  or  parallel  2-3 
tree  data  structures. 

We  conclude  with  the  following  open  question:  Can  the  MDD  algorithm  be  implemented  to  run 
in  0(  n^  log  n(y/m/p))  time  using  p  =  0{y/m)  processors  and  a  linear  amount  of  memory? 
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